Working on projects in R is typically an iterative experience. You write a function, record the results, explore the results with plots, improve your functions, rinse and repeat. Repeat for just a little while and you end up with an environment filled with forgotten names and a script half a mile long (speaking from personal experience).

This iterative process is especially true for QTL mapping experiments in R. Even with high-quality data in hand, it may not be obvious what the proper mapping technique you should use, or what parameters should be used for a given technique. The process of choosing the appropriate model and model parameters has to instead be tailored to the organism, trait, and structure of the data for each experiment on a case-by-case basis.

In practice, this involves fitting many models, and inspecting the results of each model and how they compare to one another. Keeping track of the models alone can quickly become difficult, let alone how they compare with one another.

One option to alleviate some of these challenges is to use workflow management software which provide an organizational framework for projects in R (and other languages). One such package in R is targets. In the authors words:

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets skips costly runtime for tasks that are already up to date, runs the necessary computation with implicit parallel computing, and abstracts files as R objects. A fully up-to-date targets pipeline is tangible evidence that the output aligns with the code and data, which substantiates trust in the results.

In general, using these management tools allow researchers to efficiently manage, understand, and communicate the results of projects that would be frustrating for the programmer to keep track of, and difficult for someone new to the code to understand. Furthermore, these packages are designed to ensure that the results obtained are reproducible so that both you and anyone you share the code with can have confidence in the consistency of the results you obtain.

The targets package also has an ecosystem of side “helper” packages that augment its base capabilities. These include tflow to easily set up projects that use targets, fnmate to define functions in targets projects with simple shortcuts, and the targetopia collection of target-adjacent packages for more specialized applications.

Beyond providing an organizational framework, targets steers users towards a workflow which makes heavy use of functional programming. This allows users to create analysis plans that are both compact and very flexible.

I’ve become very fond these packages, and end up using them in just about every project I do beyond the most trivial of jobs. The goal of this document is to both show how to use the targets package to implement a QTL mapping pipeline, and demonstrate why I like using it and its related packages so much.

This blog post by the author of the tflow and fnmate packages is an especially good (and accessible) overview of the benefits of using these workflows, and how to implement them. It almost single-handedly convinced me to start using these packages. The drake package mentioned in the post is however the predecessor of the targets package. The syntax has changed a bit between the two packages, but the overall goals and design has remained largely the same. Theres even a chapter in the targets manual dedicated to explaining the differences between the two.

Some other potentially helpful links are this general overview

QTL Mapping

Before I go into the actual implementation, I’ll (briefly) go over a very general QTL mapping workflow.

Techniques

r/qtl provides several techniques for performing QTL mapping. Broadly, these may be broken down in terms of those designed to estimate the effects of single QTL, and those designed to estimate the effects of multiple QTL which act simultaneously to influence some trait. In r/qtl the single QTL methods are implemented with the scanone function, and the multi-QTL methods are implemented with the cim, stepwiseqtl , and mqmscan functions for composite interval mapping , multiple interval mapping through forward/backward search, and multiple QTL mapping (MQM) respectively.

A layer of difficulty is added by the multi-QTL models as each has a number of parameters which must be supplied by a user and it is not always obvious what the best choice should be. As an example, in composite interval mapping, a user must supply the number of marker cofactors to use in the analysis, and a window size (in cM). Appropriate values to choose for these parameters is not always clear, and it is typically best to try different combinations, and compare the results you get.

Significance Thresholds

Once a scan is performed with some technique, a user then has to decide what qualifies as a significant QTL. Evidence for a QTL is typically given as a logarithm of odds (LOD) score where higher values indicate more evidence in favor of the presence of a QTL at some location. To determine an appropriate threshold for declaring significance, a permutation test is typically performed. This involves randomizing the phenotype data against the genotype data, performing a scan for QTL, and recording the highest LOD score obtained. By repeating this procedure many times, a user can generate a distribution of LOD scores that you would expect to see if there was no association between the phenotypes and genotypes for your specific data set. A user can the select a percentile of this distribution and declare any QTL with a LOD score above that threshold to be significant.

Estimating effects

Once a set of significant QTL has been found, the next step is

General workflow

Below I made a simple flowchart to show the general steps for the QTL mapping analysis. These are the general steps that I will incorporate into the workflow I implement with targets. In practice, the analysis flow chart itself will be more complicated, but I’ll get to that in a bit.

You can see that the main differences between the single and multi-QTL mapping paths stem from the fact that you have to evaluate the effect of the different parameters supplied to the model. This isn’t to say that using the single-QTL mapping techniques don’t involve specifying important model parameters, but in most cases, a researcher will have a good idea what an appropriate choice would be from the onset of the experiment in contrast to the parameters for the multi-QTL techniques that are more subjective.

Working with targets

First, just in case you haven’t read this blog post yet, do it.

With that general workflow above, the next part is how to actually go about implementing it in code. Translating a conceptual flowchart like the one above to code in targets is actually (relatively) straightforward. Briefly, targets requires you to have a specially formatted “master” script file that tells R what tasks you want to perform, and how you will perform them. The “what” part is just a name, and the “how” part given through function calls. A target is essentially just a name a function, and a workflow is a series of these targets that you tie together by using the output of one target as the input to another. In terms of the flowchart above, the “what” is the name of the shapes, and the “how” is what you do to move from one shape to the next. targets then works by analyzing the code in your master script to build a directed graph of analysis tasks. When you run a workflow, this graph is then used to calculate tasks in the order in which they show up. This graph structure is also what targets uses to keep an analysis up to date. If you change a function that is used in an early target, you can see what output is affected by seeing what downstream targets are connected to the target that uses the changed function as a command.

There’s a few ways to go about implementing a workflow in targets, but I like using the tar_plan function which comes from the tarchetypes package. Briefly, this function wraps a series of target definitions using a special format. To give an example, here’s what the QTL mapping plan looks like with just the first four targets. tar_target is the fuction you use to define a target. The first argument to the function is the name of the target, the second is the “command”, what the target needs to do. There’s many other available arguments, but you can do a whole lot with just these two.

For now, ignore all the text in the script before the tar_plan call. In the plan below, the first target is named InputFile. It just defines where I have the .csv cross stored on my computer. The following targets read in the cross, perform QTL mapping with single marker regression and perform permutations with single marker regression. There’s a couple things to note here. First, you can see the special format argument in the first target. This lets the workflow know that this target is an external file. In the second target, I used a different way to name the target. This is the naming scheme from drake where you assign the output of the target command to the target name. In practice, this is equivalent too the following two target definitions which have the name in the first argument position of the function, and the command given by the custom function call in the second position.


A simple plan (click the triangle to show/hide)
## Load your packages, e.g. library(targets).
source("./packages.R")

## Load your R files
lapply(list.files("./R", full.names = TRUE), source)

## tar_plan supports drake-style targets and also tar_target()
tar_plan(
  
  # The input cross data, already in the r/qtl "csv" format
  tar_target(InputFile, 
             "Data/ExampleCross.csv",
             format = "file"),
  
  # Read the QTL file in with r/qtls "read.cross" function. Also apply
  # a few basic operations to clean the initial cross
  QTLData = read_cross(QTL_File = InputFile),
  
  # Marker regression mapping
  tar_target(RegressionResults,
             Map_Regression(CrossData = QTLData,
                                Pheno = phenames(QTLData))),
  
  # Permutations for the marker regression
  tar_target(RegressionPerms,
             Map_Regression_Perms(CrossData = QTLData,
                                  Pheno     = phenames(QTLData),
                                  nPerm     = 100))
  
)


To give more detail, look at what the (barely) custom Map_Regression function from the third target (RegressionResults) looks like.


marker regression function
Map_Regression <- function(CrossData = QTLData, Pheno = phenames(QTLData)) {

  # Just a wrapper for scanone with the method set, and arguments 
  # supplied in names defined in the workflow
  scanone(cross     = CrossData, 
          pheno.col = Pheno,
          method    = "mr")

}


You can see why I say “barely”. This function is just a wrapper for the scanone function from r/qtl with the method set to marker regression, and slightly different argument names. You could just as easily use the scanone function alone, I only wrapped the function so that it would be more obvious what I was doing when I read through my plan.

Now would be a good time to introduce another helpful function. At any point, you can see the underlying graph that drives the workflow with the tar_visnetwork function. Here is what this network looks like right now with just these four targets.

<!DOCTYPE html> visNetwork

This is a pretty straightforward workflow, and it’s not very different than what you would get by writing a standard R script. Things get more interesting when we start to fit many models, and compare their results.

The cim function from r/qtl is used to perform composite interval mapping. This function has two main parameters: the number of marker cofactors, and the window size. When performing a qtl mapping experiment, you might want to try different combinations of these parameters and inspect the results of each combination. This is where targets really starts to shine.

Here’s what the plan looks like after I added some functions to perform composite interval mapping with many different parameter combinations.


The plan with composite interval mapping added
## Load your packages, e.g. library(targets).
source("./packages.R")

## Load your R files
lapply(list.files("./R", full.names = TRUE), source)

AllPhenos <- c("EarHeight",
               "LeafLength",
               "LeafWidth",
               "CobWeight",
               "TotalKernelVolume")

# Combinations of parameters for composite interval mapping.
# Useful early on for checking the effects of different model parameters on the output
CIM_ModelParams <- expand_grid(
  CIM_n.marcovar = c(5, 7, 10),
  CIM_Window     = c(10, 15, 20),
  PhenoNames     = AllPhenos
)

## tar_plan supports drake-style targets and also tar_target()
tar_plan(
  
  # The input cross data, already in the r/qtl "csv" format
  tar_target(InputFile, 
             "Data/ExampleCross.csv",
             format = "file"),
  
  # Read the QTL file in with r/qtls "read.cross" function. Also apply
  # a few basic operations to clean the initial cross
  QTLData = read_cross(QTL_File = InputFile),
  
  # Marker regression mapping and permutations
  tar_target(RegressionResults,
             Map_Regression(CrossData = QTLData,
                                Pheno = phenames(QTLData))),
  
  tar_target(RegressionPerms,
             Map_Regression_Perms(CrossData = QTLData,
                                  Pheno     = phenames(QTLData),
                                  nPerm     = 100)),
  

  # Simple interval mapping and permutations
  tar_target(SimpleIntervalResults,
             Map_SimpleInterval(CrossData   = QTLData,
                                      Pheno = phenames(QTLData))),
  
  tar_target(SimpleIntervalPerms,
             Map_SimpleInterval_Perms(CrossData = QTLData,
                                      Pheno     = phenames(QTLData),
                                      nPerm     = 100)),
  
  # Composite interval mapping
  # Perform composite interval mapping with each combination of phenotype,
  # number of marker covariates, and window size as specified above
  CIM_Mapped <- tar_map(
    
    values = CIM_ModelParams,
    
    tar_target(CompositeIntervalResults,
               Map_Composite(CrossData  = QTLData,
                             Pheno      = PhenoNames,
                             nCovar     = CIM_n.marcovar,
                             WindowSize = CIM_Window,
                             perm       = F))
    
  )
)


A first thing to notice are the AllPhenos vector and the CIM_ModelParams tibble before the call to tar_plan. AllPhenos is a vector which says what phenotypes I want to perfrom QTL mapping on, and the CIM_ModelParams is a table that holds every combination of model parameters (number of marker cofactors, window size, and phenotype) that I want to try. In this case, I used the expand_grid function to return a tibble with a row for every combination of the three paramater vectors. I supplied three numbers of marker cofactors, three window sizes, and 5 phenotypes, so a total of 45 (3 x 3 x 5) models will be fit.

Normally, this would be relatively cumbersome to implement in base R, and even harder to keep the results organized. In targets, a solution to this repetition of an analysis task with different parameter combinations is already provided through branching. In short, this allows you to supply a target with a data structure that holds named parameter values which can then be accessed by a function with arguments that refer to these parameter values by their names.

A user can then define a pattern when using dynamic branching, or use helper functions from tarchetypes when using static branching to describe how the values in this data structure should be used to define a new set of targets.

In that plan above, I give the parameter combinations in the CIM_ModelParams tibble up front, and actually define the new targets in the CIM_Mapped step. The tar_map function I use there is a special helper function from tarchetypes. This uses similar logic to the map function from purrr where the function you give in the target definition(s) is applied to each parameter(s) you provide with the value argument. In the example above, you can see that the CompositeIntervalResults target calls the Map_Composite function which takes as arguments a cross object, a number of marker cofactors, a window size, and a phenotype. The important thing to note is the values given for these arguments match the names of the paramaters given in the CIM_ModelParams tibble, with the exception of the cross object which is provided through the name of the upstream target which holds the cross object. This is how targets figures out which targets need to be created, and what commands (function calls) need to be performed for each.

At this point, it may be helpful to check up on what the graph looks like now.
<!DOCTYPE html> visNetwork


With just a few lines,

Data overview

Reproducibility

Reproducibility receipt
## [1] "2021-03-23 13:56:24 EDT"
## Local:    main C:/Users/Jay/Desktop/Documents/R/QTL_targets_workflow
## Remote:   main @ origin (https://github.com/jhgille2/QTL_targets_workflow.git)
## Head:     [ad0f8d8] 2021-03-23: Added CIM network plot, added to the analysis writeup.
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.6.1   echarts4r_0.3.3      LinkageMapView_2.1.2
##  [4] qtlDesign_0.941      magrittr_2.0.1       here_1.0.1          
##  [7] ASMap_1.0-4          lattice_0.20-41      RColorBrewer_1.1-2  
## [10] fields_11.6          spam_2.6-0           dotCall64_1.0-0     
## [13] gtools_3.8.2         forcats_0.5.1        stringr_1.4.0       
## [16] dplyr_1.0.4          purrr_0.3.4          readr_1.4.0         
## [19] tidyr_1.1.2          tibble_3.0.6         ggplot2_3.3.3       
## [22] tidyverse_1.3.0      qtl_1.47-9           rmarkdown_2.6       
## [25] tarchetypes_0.0.4    targets_0.1.0.9000   dotenv_1.0.2        
## [28] conflicted_1.0.4    
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.0          lubridate_1.7.9.2 httr_1.4.2        rprojroot_2.0.2  
##  [5] tools_4.0.2       backports_1.2.1   utf8_1.1.4        R6_2.5.0         
##  [9] DBI_1.1.0         colorspace_2.0-0  withr_2.4.1       tidyselect_1.1.0 
## [13] processx_3.4.5    git2r_0.27.1      compiler_4.0.2    cli_2.3.1        
## [17] rvest_0.3.6       xml2_1.3.2        scales_1.1.1      callr_3.5.1      
## [21] digest_0.6.27     pkgconfig_2.0.3   htmltools_0.5.1.1 dbplyr_2.0.0     
## [25] fastmap_1.1.0     maps_3.3.0        htmlwidgets_1.5.3 rlang_0.4.10     
## [29] readxl_1.3.1      rstudioapi_0.13   shiny_1.6.0       visNetwork_2.0.9 
## [33] generics_0.1.0    jsonlite_1.7.2    Rcpp_1.0.6        munsell_0.5.0    
## [37] fansi_0.4.2       lifecycle_1.0.0   yaml_2.2.1        stringi_1.5.3    
## [41] parallel_4.0.2    promises_1.2.0.1  crayon_1.4.1      haven_2.3.1      
## [45] hms_1.0.0         knitr_1.31        ps_1.5.0          pillar_1.5.0     
## [49] igraph_1.2.6      codetools_0.2-16  reprex_0.3.0      glue_1.4.2       
## [53] evaluate_0.14     data.table_1.13.6 modelr_0.1.8      vctrs_0.3.6      
## [57] httpuv_1.5.5      cellranger_1.1.0  gtable_0.3.0      assertthat_0.2.1 
## [61] cachem_1.0.3      xfun_0.21         mime_0.10         xtable_1.8-4     
## [65] broom_0.7.5       later_1.1.0.1     memoise_2.0.0     ellipsis_0.3.1